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ML-estimation based on mixtures of Normal distributions is a 
widely used tool for cluster analysis. However, a single outlier can 
make the parameter estimation of at least one of the mixture com- 
ponents break down. Among others, the estimation of mixtures of 
t-distributions by McLachlan and Peel [Finite Mixture Models (2000) 
Wiley, New York] and the addition of a further mixture component 
accounting for "noise" by Fraley and Raftery [The Computer J. 41 
(1998) 578-588] were suggested as more robust alternatives. In this 
paper, the definition of an adequate robustness measure for cluster 
analysis is discussed and bounds for the breakdown points of the 
mentioned methods are given. It turns out that the two alternatives, 
while adding stability in the presence of outliers of moderate size, do 
not possess a substantially better breakdown behavior than estima- 
tion based on Normal mixtures. If the number of clusters s is treated 
as fixed, r additional points suffice for all three methods to let the 
parameters of r clusters explode. Only in the case of r = s is this not 
possible for i-mixtures. The ability to estimate the number of mixture 
components, for example, by use of the Bayesian information crite- 
rion of Schwarz [Ann. Statist. 6 (1978) 461-464], and to isolate gross 
outliers as clusters of one point, is crucial for an improved breakdown 
behavior of all three techniques. Furthermore, a mixture of Normals 
with an improper uniform distribution is proposed to achieve more 
robustness in the case of a fixed number of components. 



1. Introduction. ML-estimation based on mixtures of Normal distribu- 
tions (NMML) is a flexible and widely used technique for cluster anal- 
ysis [e.g., Wolfe (1967), Day (1969), McLachlan (1982), McLachlan and 
Basford (1988), Fraley and Raftery (1998) and Wang and Zhang (2002)]. 
Moreover, it is applied to density estimation and discrimination [Hastie 
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and Tibshirani (1996) and Roeder and Wasserman (1997)]. Banfield and 
Raftery (1993) introduced the term "model-based cluster analysis" for such 
methods. 

Observations xi,...,x n are modeled as i.i.d. with density 

s 

(!•!) fr,(x) = ^2irjip 2(x), 

where rj = (s,ai, . . . , a s , o\, . . . , a s ,iri, . . . ,ir s ) is the parameter vector, the 
number of components s € N may be known or unknown, ctj £ R, aj > 0, 
TTj > 0, j = 1, . . . , s, Sj=i = 1 an d V'a.o- 2 denotes the density of a Normal 
distribution with mean a and variance a 2 , tp = tpoi- Mixtures of multivariate 
Normals are often used, but for the sake of simplicity, considerations are 
restricted to the case of one-dimensional data in this paper. The results 
essentially carry over to the multivariate case. 

As in many other ML-techniques that are based on the Normal distribu- 
tion, NMML is not robust against gross outliers, in particular, if the num- 
ber of components s is treated as fixed: the estimators of the parameters 
oi, . . . ,a s are weighted means of the observations. For each observation, the 
weights sum up to 1 [see Redner and Walker (1984)], which means that at 
least one of these parameters can become arbitrarily large if a single extreme 
point is added to a dataset. 

There are some ideas in the literature to overcome the robustness prob- 
lems of Normal mixtures. The software MCLUST [Fraley and Raftery (1998)] 
allows the addition of a mixture component accounting for "noise," mod- 
eled as a uniform distribution on the convex hull (the range in one dimen- 
sion, respectively) of the data. The software EMMIX [Peel and McLach- 
lan (2000)] can be used to fit a mixture of ^distributions instead of Normals. 
Further, it has been proposed to estimate the component parameters by 
more robust estimators [Campbell (1984), McLachlan and Basford (1988) 
and Kharin (1996), page 275], in particular, by Huber's (1964, 1981) M- 
estimators corresponding to ML-estimation for a mixture of Huber's least 
favorable distributions [Huber (1964)]. 

While a clear gain of stability can be demonstrated for these methods in 
various examples [see, e.g., Banfield and Raftery (1993) and McLachlan and 
Peel (2000), page 231 ff.], there is a lack of theoretical justification of their 
robustness. Only Kharin [(1996), page 272 ff.] obtained some results for fixed 
s. He showed that under certain assumptions on the speed of convergence of 
the proportion of contamination to with n — > oo, Huber's M-estimation is 
asymptotically superior to NMML. In the present paper, mixtures of a class 
of location-scale models are treated including the aforementioned distribu- 
tions. The addition of a "noise" component is also investigated. 
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Up to now there is no agreement about adequate robustness measures 
for cluster analysis. In a model-based cluster analysis, the clusters are char- 
acterized by the parameters of their mixture components. For fixed s an 
influence function [Hampel (1974)] and a breakdown point [Hampel (1971) 
and Donoho and Huber (1983)] for these parameters can easily be defined. 
The "addition breakdown point" is the minimal proportion of points to be 
added to an original dataset so that the parameter estimator for the new 
dataset deviates as far as possible from the value obtained from the original 
dataset. However, there are some particular issues in cluster analysis. Parti- 
tioning methods may possess a bounded influence function and the minimal 
possible breakdown point at the same time. The breakdown point may de- 
pend strongly on the constellation of the data points [Garcia-Escudero and 
Gordaliza (1999)]. One may distinguish between breakdown of a single clus- 
ter and breakdown of all clusters [Gallegos (2003)], and breakdown could 
be characterized by means of the classification of the points instead of the 
estimated parameters [Kharin (1996), page 49]. The breakdown concepts in 
the literature cited above only apply to a fixed number of components s. If 
s is estimated, there are data constellations "on the border" between two 
different numbers of components, leading to different numbers of parameters 
to estimate. 

The outline of the paper is as follows. In Section 2, the techniques treated 
in this paper and their underlying models are introduced. 

In Section 3, robustness measures and breakdown points in terms of pa- 
rameters (Definition 3.1 for fixed s, Definition 3.2 for estimated s) as well 
as of classification (Definition 3.4) are defined. 

In Section 4, results about the parameter breakdown behavior of the 
mixture-based clustering techniques are derived. It is shown that all dis- 
cussed techniques have a breakdown point of r/(n + r) for r < s of the mix- 
ture components in the case of fixed s (Theorem 4.4). A better breakdown 
behavior can be attained by maximizing a kind of "improper likelihood" 
where "noise" is modeled by an improper uniform distribution on the real 
line (Theorem 4.11). For the case of estimated s, using an information cri- 
terion [Akaike (1974) and Schwarz (1978)], a breakdown point larger than 
l/(n + 1) can be attained for all considered methods. They all are able to 
isolate gross outliers as new mixture components on their own and are there- 
fore very stable against extreme outliers. However, breakdown can happen 
because additional points inside the area of the estimated mixture compo- 
nents of the original data can lead to the estimation of a smaller number of 
components (Theorems 4.13 and 4.16). Some numerical examples are given, 
illustrating the relative stability of the methods and the nonequivalence of 
parameter and classification breakdown and of addition and replacement 
breakdown. Some data constellations turn out to be so stable that they lead 
to an addition parameter breakdown point larger than 1/2. The paper is 
completed by some concluding discussions. 
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2. Models and methods. The Normal mixture (1.1) belongs to the class 
of mixtures of location-scale families /„ which can be defined as follows: 

(2.1) fvi x ) = Y1^3faj,(T j (x), where f at(7 (x) = -f( — 

i=i 

where rj is defined as in (1.1). Assume that 

(2.2) / is symmetrical about 0, 

(2.3) / decreases monotonicly on [0,oo], 

(2.4) / > on R, 

(2.5) / is continuous. 

Besides the Af(0, l)-distribution, these assumptions are fulfilled, for exam- 
ple, for the ^-distribution with v degrees of freedom and for Huber's least 
favorable distribution, used as a basis for mixture modeling in Peel and 
McLachlan (2000) and McLachlan and Basford (1988), respectively. 

The following properties will be referred to later. It follows from (2.2)— 
(2.4) that, for given points x±, . . . ,x n and a compact set C = [a, b] x [p,, £] c 
M. x R + (this notation implies [i > here) , 

(2.6) wi{f a ,cr(x) :x € {xi,.. .,x n }, (a, a) € C} = / min > 0. 

For fixed x, linim^oo a m = oo and arbitrary sequences (o , m ) m eN) observe that 

(2.7) liir t f am , am (x)< lini i minf— /(0),-/(^^)) =0 

as long as a m > oq > 0. 

The addition of a uniform mixture component on the range of the data is 
also considered, which is the one-dimensional case of a suggestion by Banfield 
and Raftery (1993). That is, for given x m - m < x max € E, 

(c \ t ( \ t ( \ i l[x £ [x m ; n , X max ]) 

(2.8) / c (x) = ^ njf aj!(rj (x) + — , 

where C = (s, a u . . . ,a s , a u . . . ,cr s ,7r ,7n, . . . ,7r s ), vr , . . . ,vr s > 0, Yfj=o^j = 1 
and 1(A) is the indicator function for the statement A. 

The log-likelihood functions for the models (2.1) and (2.8) for given data 
x„, with minimum x m \ n ^ n and maximum x masL)n (this notation is also used 
later), are 

(2.9) L n>s (f7,x n ) =J2log{j2 7T jfa j ,<r 3 ( x i) ) , 

j=l \j = l / 

n / s \ 

7TQ \ 



(2.10) L n , s (C,x n ) =^log ^7r,/ a , )CT ,(^) + 



i=1 \j=l x max,n x r. 
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As can easily be seen by setting ai = xi, o~\ — > 0, L„ jS can become arbitrarily 
large for s > 1. Thus, to define a proper ML-estimator, the parameter space 
must be suitably restricted. The easiest restriction is to specify oq > and 
to demand 



This is used, for example, in DeSarbo and Cron (1988) and may easily be 
implemented with the EM-algorithm [Dempster, Laird and Rubin (1977) 
and Redner and Walker (1984); see Lemma 2.1], the most popular routine 
to compute mixture ML-estimators. A drawback of this restriction is that 
the resulting ML-estimators are no longer scale equivariant because the scale 
of the data can be made smaller than ao by multiplication with a constant. 
The alternative restriction 



for fixed c G (0,1] leads to properly defined, scale-equivariant, consistent 
ML-estimators for the Normal case / = c^o,i without noise [Hathaway (1985)]. 
This includes the popular simplification a± = ■ ■ ■ = a s , which corresponds to 
/c-means clustering and is the one-dimensional case of some of the covariance 
parameterizations implemented in MCLUST [Fraley and Raftery (1998)]. 
However, unless c = 1, the computation is not straightforward [Hathaway (1986)]. 
Furthermore, the restriction (2.12) cannot be applied to the model (2.8), be- 
cause the log-likelihood function may be unbounded; see Lemma A.l. For 
the case of fixed s, Corollary 4.5 says that estimation based on (2.12) does 
not yield better breakdown properties than its counterpart using (2.11). 
Therefore, the restriction (2.11) is used for all other results. Guidelines for 
the choice of uq and c are given in Section A.l. For results about consis- 
tency of local maximizers of the log-likelihood function, see Redner and 
Walker (1984). 

The following lemma summarizes some useful properties of the maxi- 
mum likelihood estimators, which follow from the derivations of Redner and 
Walker (1984). 

Notation. Let 8j = (a,-, Oj), j = l,...,s, 6 = (9i, ...,0 S ) denote the lo- 
cation and scale parameters of r\ and £, respectively, 6*, 77* , (* by analogy. 
The parameters included in 77* , £* will be denoted by s* , aj, ir\ and so on, 
and by analogy for 77, C> • • • ■ 

Lemma 2.1. For given 77, let 



(2.11) 



O-j > CTq 



j = l,...,S. 



(2.12) 



min (Tj/cTfc > c 
j,k=l,...,s 



(2.13) 



Pij 



i = 1, ... ,n. 



Sfc=l 7r fe/a fe ,cr fc (a;j) 
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A maximizer f) of 

(2-14) E 

i=i 



5^Pylog7rJ 



i=l 



J=li=l 



ower 77* /ea<is to an improvement of L n s unless rj itself attains the maximum 
of (2.14). 

For given £ in (2.10), the same statements hold with 



Ttjfa,j,<Tj(%i) . 1 

Pij = ^S 1 7 \~~f 1 / V' J = 1 t 



3 > u 3 

(2 15) Sfc=l ^"fc/afe,<Tfe (^i) "I - TTo/ (^max.n 2-mi: 

^"0/ (3-max,n ^-min,n) 

PiO 



2~Zfc=l ^"fc/afe,cr fe (^i) "I - TTo/ (^max.n ^min,n 

In (2.14), the first sum starts at j = 0. 

For any global maximizer n as well as £ of L n-S for given x n , the following 
conditions hold under (2.11) for j = 1, . . . , s with pij, i = 1, . . . ,n: 



1 n 



(2.16) n j = -J2Pii' 

11. 

2=1 

(aj,aj) = arg max Sj (a* , a* ) 

(2.17) 

= arg max J2 Pij lo S / ( 3 J ) • 

In case 0/ (2.10), property (2.16) holds for j = as well. 

Note that (2.13) defines the so-called E-step, and maximization of (2.14) 
defines the so-called M-step of the EM-algorithm, where the two steps are 
alternately carried out. 

Lemma 2.2. Under (2.11), with 

<ro/(0) 



C=[x 



mm.ni •< max.nj 



CO) 



/(( ^max,n ^min,n)/0o) 



and 7ri, . . . ,tt s > 0, 

(2.18) \/6*£C s 30GC s :L n , s ( I/ )>L n , s (r,*). 

Proofs are given in Section A. 2. 

Note that L n>s is continuous [cf. (2.5)] and a global maximizer has to lie 
in C s x [0, l] s because of (2.18). Therefore, we have following result. 

Corollary 2.3. Under the restriction (2.11), there exists a {not nec- 
essarily unique) global maximum of L n ^ s with arguments in C s x [0,l] s . 
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For NMML and (2.12), this is shown by Hathaway (1985). Define rj n>s = 
argmaxL,,,^ and £ njS analogously. In the case of nonuniqueness, i] n ^ s can 
be defined as an arbitrary maximizer, for example, the lexicographically 
smallest one. The pij-v&lues from (2.13) and (2.15), respectively, can be 
interpreted as the a posteriori probabilities that a point %i had been gen- 
erated by component j under the a priori probability ttj for component j 
with parameters aj , Uj . These values can be used to classify the points and 
to generate a clustering by 

(2.19) l(xi) = argmaxpij, i=l,...,n, 

j 

where the ML-estimator is plugged into the definition of pij. 

All theorems derived in the present paper will hold for any of the max- 
imizers. For ease of notation, r/ njS and £ njS will be treated as well defined 
in the following. Note that, for s > 1, nonuniqueness always occurs due to 
"label switching" of the mixture components. Further, for ease of notation, 
it is not assumed, in general, that irj > Vj or that all (a,j,aj) are pairwise 
distinct. 

Consider now the number of mixture components s G N as unknown. The 
most popular method to estimate s is the use of information-based criteria 
such as AIC [Akaike (1974)] and BIC [Schwarz (1978)]. The latter is im- 
plemented in MCLUST. EMMIX computes both. The estimator s n for the 
correct order of the model is defined as s n = arg max s C(s), where 

C(s) = AIC(s) = 2L nyS ( Vn , s ) - 2k or 

(2.20) 

C(s) = BIC(s) = 2L njS {7] n>s ) - k log n, 

where k denotes the number of free parameters, that is, k = 3s — 1 for (2.1) 
and k = 3s for (2.8). Under assumptions satisfied under (2.11) but not under 
(2.12) for the models discussed here (compare Lemma A.l), Lindsay [(1995), 
page 22] shows that the number of distinct points in the dataset is an upper 
bound for the maximization of Ln,s(%,s) ° ver s > an d therefore for the max- 
imization of C(s) as well. Thus, only a finite number of values for s have to 
be investigated to maximize C(s) and this means that (again not necessarily 
unique) maximizers exist. 

While the AIC is known to overestimate s asymptotically [see, e.g., Bozdogan 
(1994)], the BIC is shown at least in some restricted situations to be con- 
sistent in the mixture setup [Keribin (2000)]. I mainly consider the BIC 
here. Further suggestions to estimate s, which are more difficult to analyze 
with respect to the breakdown properties, are given, for example, by Boz- 
dogan (1994) and Celeux and Soromenho (1996). EMMIX also allows the 
estimation of s via a bootstrapped likelihood ratio test [McLachlan (1987)]. 
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3. Breakdown measures for cluster analysis. The classical meaning of 
"addition breakdown" for finite samples is that an estimator can be driven 
arbitrarily far away from its original value by addition of unfortunate data 
points, usually by gross outliers. For "replacement breakdown points" , points 
from the original sample are replaced [Donoho and Huber (1983)]. Zhang and 
Li (1998) and Zuo (2001) derive relations between these two concepts. In 
the present paper, addition breakdown is considered. Breakdown means that 
estimators that can take values on the whole range of MP can leave every 
compact set. If the range of values of a parameter is bounded, breakdown 
means that the addition of points can take the estimator arbitrarily close to 
the bound, for example, a scale parameter to 0. Such a definition is relatively 
easily applied to the estimation of mixture components, but it cannot be 
used to compare the robustness of mixture estimators with other methods 
of cluster analysis. 

Therefore, the more familiar parameter breakdown point will be defined 
first. Then, a breakdown definition in terms of the classification of points to 
clusters is proposed. 

A "parameter breakdown" can be understood in two ways. A situation 
where at least one of the mixture components explodes is defined as break- 
down in Garcia-Escudero and Gordaliza (1999). That is, breakdown occurs 
if the whole parameter vector leaves all compact sets [not including scales 
of under (2.12)]. In contrast, Gallegos (2003) defines breakdown in cluster 
analysis as a situation where all clusters explode simultaneously. Intermedi- 
ate situations may be of interest in practice, especially if a researcher tries 
to prevent the breakdown of a single cluster by specifying the number of 
clusters to be larger than expected, so that additional clusters can catch the 
outliers. This is discussed (but not recommended — in agreement with the 
results given here) by Peel and McLachlan (2000). The definition given next 
is flexible enough to account for all mentioned situations. 

Definition 3.1. Let (E n ) n ^ be a sequence of estimators of r\ in model (2.1) 
or of C in model (2.8) on R" for fixed s £ N. Let r < s, x n = (x±, . . . , x n ) be 
a dataset, where 




9 I n + g 



\/D= [7T min , 1] X C, 7T min > 0, 

Cclx ]R + compact 3x n+3 = (x\, . . . ,x n+g ), 
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The proportions ttj are defined not to break down if they are bounded 
away from 0, which implies that they are bounded away from 1 if s > 1. 
Assumption (3.1) is necessary for the definition to make sense; itj = would 
imply that the corresponding location and scale parameters could be chosen 
arbitrarily large without adding any point. Condition (3.1) may be violated 
for ML-estimators in situations where s is not much smaller than n, but 
these situations are usually not of interest in cluster analysis. In particular, 
(3.1) does not hold if s exceeds the number of distinct Lindsay [(1995), 

page 23]. 

The situation of ttq — > in model (2.8) is not defined as breakdown, be- 
cause the noise component is not considered as an object of interest in itself 
in this setup. 

In the case of unknown s, considerations are restricted to the case of one- 
component breakdown. Breakdown robustness means that neither of the s 
mixture components estimated for x n vanishes, nor that any of their scale 
and location parameters explodes to oo under addition of points. It is, how- 
ever, allowed that the new dataset yields more than s mixture components 
and that the additional mixture components have arbitrary parameters. This 
implies that, if the outliers form a cluster on their own, their component can 
simply be added without breakdown. Further, breakdown of the proportions 
ttj to is no longer of interest when estimating s according to the AIC or 
BIC, because if some iij is small enough, component j can be simply left 
out, and the other proportions can be updated to sum up to 1. This solution 
with s — 1 clusters leads approximately to the same log-likelihood and will 
be preferred due to the penalty on the number of components: 

Definition 3.2. Let (E n ) 

n gN be a sequence of estimators of r\ in model (2.1) 
or of £ in model (2.8) on K n , where s € N is estimated as well. Let x n = 
(xi, . . . ,x n ) be a dataset. Let s* be the estimated number of components of 
E n (pc n ). The parameter breakdown point of E n is defined as 



This implies especially that breakdown occurs whenever s < s* . 

Now, the classification breakdown is defined. A mapping E n is called a 
general clustering method (GCM) if it maps a set of entities x n = {xi, . . . , x n } 
to a collection of subsets {C±, . . . , C s } of x n . A special case are partitioning 




pairwise distinct j\, . . . ,j s * do not exist, 



such that (cijj ,aj 
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methods where Cj n Cj = for % ^ j < s, Uj=i C s = x n . An ML-mixture 
estimator induces a partition by (2.19) and Cj = {x{ : l(xi) = j}, given a rule 
to break ties in the pij. 

If E n is a GCM and x n+g is generated by adding g points to x n , E n+g (x n+g ) 
induces a clustering on x n , which is denoted by i?*(x n + 9 ). Its clusters are 
denoted by Cj", . . . , C** . If i£ n is a partitioning method, E*(pc n+g ) is a parti- 
tion as well. Note that s* may be smaller than s when E n produces s clusters 
for all n. Assume in the following that E n is a partitioning method. The re- 
sulting definition may be tentatively applied to other clustering methods as 
well. 

As will be illustrated in Remark 4.18, different clusters of the same data 
may have a different stability. Thus, it makes sense to define robustness with 
respect to the individual clusters. This requires a measure for the similarity 
between a cluster of E*(x n + g ) and a cluster of E n (x n ), that is, between two 
subsets C and D of some finite set. The following proposal equals only for 
disjoint sets and 1 only for equal sets: 



\C\ + \D\ 

The definition of (addition) breakdown is based on the similarity of a cluster 
C 6 E n (pc n ) to its most similar cluster in E*(x n+g ). A similarity between C 
and a partition E n is defined by 

7 *(CA(x n ))= min 7 (C,D). 

DeB n (x„) 

How small should 7* be to say that breakdown of C has occurred? The usual 
choice in robust statistics would be the worst possible value. In the present 
setup, this value depends on the dataset and on the clustering method. For 
example, if n = 12 and \C\ = 6, the minimum for 7*(C, E*(x n+g )) is 1/4. It 
is attained by building s* = 6 clusters with two points each, one of which 
is in C. But s < 6 may be fixed, and this would result in a larger mini- 
mum. Even under estimated s the minimum may be larger. For example, if 
the points lie on the real line and the clustering method produces only con- 
nected clusters, we get 7*(C, E* (x n+9 )) > 2/7. In general, the worst possible 
value may be difficult to compute and sometimes only attainable by tricky 
combinatorics, while one would judge a cluster as "broken down" already in 
much simpler constellations of E^(x n+g ). I propose 

(3.2) 7*<| = 7({x,y},{x})= 7 (C,C7 1 ) if d C C, \d\ = \C\/2, 

as the breakdown condition motivated by the following lemma, which means 
that under this condition every cluster can break down, at least in the ab- 
sence of further subtle restrictions on the possible clusterings. 
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Lemma 3.3. Let E n (x n ) 3 C be a partition with \E n (x n )\ > 2. Let S CN 
be the set of possible cluster numbers containing at least one element s>2. 
Let T = {F partition on x n :\F\ eS}. Then 3F £ JF: 7 *(C,F) < 2/3, where 
2/3 is the smallest value for this to hold. 

Definition 3.4. Let (2£ n )neN be a sequence of GCM's. The classifica- 
tion breakdown point of a cluster C € E n (x n ) is defined as 

B°(E n ,x n , C) = mini — ^— : 3x n+g = (x 1 ,..., x n+g ) : j*(C, £*(x n+s )) < - X. 

The r- clusters classification breakdown point of E n at x n is 

B^(E n ,x n ) =min I -^-:3x n+9 = (xi,.. .,x n+g ),Ci, . . . ,C r € -E„(x n ) 

pairwise distinct : 7*(C», £J*(x n+g )) < -, i = 1, . . . ,r|. 

Remark 3.5. At least r > 1 clusters of E n (x n ) have to break down 
if l-E^x^j)! = s — r. For r = 1, the reason is that there must be D € 
S*(x n+5 ) such that there are at least two members of E n (x n ), C\ and C2, 
say, for which D minimizes j(Cj,D) over E^(x n+g ). Without loss of general- 
ity, \dnD\ < \C 2 r\D\. Since dnC 2 = 0, we get j(Ci,D) < \D\/(\D\/2 + \D\ 
The same argument yields for r > 1 that j(Cj,D) < 2/3 for at least g — 1 
clusters Cj if D € E*(x n+g ) is the most similar cluster for q clusters Cj € 

-En(x„). 



Note that parameter breakdown does not imply classification breakdown 
and vice versa (cf. Remarks 4.10 and 4.18). 



4. Breakdown results. 



4.1. Breakdown points for fixed s . This section starts with three lemmas 
which characterize the behavior of the estimators for a sequence of datasets 
where there are s > h > 2 groups of points in every dataset, each group hav- 
ing a fixed range, but with the distances between the groups converging to 
00. In this case, eventually there exists a mixture component correspond- 
ing to each group, all mixture components correspond to one of the groups 
and the maximum of the log-likelihood can be obtained from the maxima 
considering the groups alone; that is, all groups are fitted separately. 

Lemma 4.1. Let x nm = (x± m , . . . ,x nm ) £ K n be a sequence of datasets 
with m € N and = rio < n\ < ■ ■ ■ < = n, h > 1. Let D\ = {1, . . . , m}, 
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D% = {ni + 1, . . . , ri2}, • ■ • , Dh = {rih-i + 1, . . . , n^}. Assume further that 
36 < oo : max max \xi m — x-j m \ < b Vm, 

k i,j&D k 



lim 



mm 



><x>k^l,ieD k ,j£Di 



oo. 



Let s > h be fixed, r\ m = argmax ?? L niS (?/,x nm ). The parameters of rj m are 
called 7Ti m , . . . ,7T sm , a\ m and so on; all results hold for £ m from maximiz- 
ing (2.10) as well. Without loss of generality, assume x\ m < x% m < ■■■ < 
rc nm . Then, for mo € N large enough, 

3 < d < oo, 7r m i n > 0, do < <r max < oo : Vm > mo, 

(4.1) fc = l,...,/i3j fc e{l,...,a} 

^jhin. ^ TTmin, "j^m £ [f > ""max] • 

Lemma 4.2. In £/ie situation of Lemma 4.1, assume further 

(4.2) 3 7r min >0:Vj = l,...,s,meN: 
T/ien 

(4.3) Vm > m , j = 1, . . . , s3 k € {1, . . . ,h} :a jm G [ai( nfc _ 1+ i) m - d,x nhm + d\. 

(4.4) 3 < a max < oo : Vm > m ,j = 1, . . . , s : CTj m E [O"o, %ax]- 



Lemma 4.3. Under the assumptions of Lemma 4.1, 



(4.5) Vlfee {!,...,&}: lim 



E 



1^1 



lim 

m— >oo 



(4.6) 



^- J n,s(j}m,i Xr 



— max I 

ELi , ?fe =s Vfc=i 



maxL| Dfe | igfe (7/,y fcm ) + |A,|log 



\D,\ 



n 



0, 



where Ykm — { x (n k _ 1 +l)mi ■ ■ ■ > x n k m)i k — 1, . . . ,h. 

In particular, r < s added outliers let r mixture components break down 
if the differences between them tend to oo. 

Theorem 4.4. Let x n £ ]R n , s > 1. Lei 6e a g/o&al maximizer of (2.9). 
Assume (2.2)-(2.5). For r = 1, . . . , s - 1, 

r 



(4.7) 



n + r 
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Equality in (4.7) could be proven for datasets where ttj —* can be pre- 
vented for j = 1, . . . , s and any sequence of sets of r added points, but con- 
ditions for this are hard to derive. 

Under the restriction (2.12), convergence of cr^-parameters to implies 
breakdown according to Definition 3.1. Thus, to prevent breakdown, an ef- 
fective lower bound for the <7j of the nonbreaking components has to exist. 
This means that all Oj have to be bounded from below, independently of 
x n +i, . . . ,x n + r , because (2.12) forces all aj to if only one implodes. There- 
fore, the result carries over. 

Corollary 4.5. Theorem 4.4 holds as well under the restriction (2.12) 
instead of (2.11). 

Remark 4.6. The situation for r = s is a bit more complicated, because 
here the choice of the basic distribution / matters. Assume that the pro- 
portion of outliers in the distorted dataset is smaller than 1/2. While s — 1 
mixture components can be broken down by s outliers, there remains at 
least one mixture component for which the original points own a majority 
of the weights used for the estimation of the parameters. If the parameters 
of such a component are estimated by a nonrobust ML-estimator such as 
the Normal one, the sth component will break down as well, that is, under 
f = <P, 

s 

Bs.nylri.sj^n) — ; • 
n + s 

The breakdown point for the joint ML-estimator of location and scale for 
a single location-scale model based on the ^-distribution was shown to be 
greater than or equal l/(u + l) by Tyler (1994), ignoring the possible break- 
down of the scale to 0, which is prevented here because of (2.11). Suppose 
that points are added so that their proportion is smaller than l/(z/ + l). 
Mixture ML-estimation with s components leads to the existence of at least 
one component such that the parameters are estimated by a weighted t v - 
likelihood according to (2.17) with weight proportion smaller than l/(y + 1) 
for the added points. Thus, 

where f(x) = q(l + x 2 /v)(~ v+1 '' 2 , v>l, q > being the norming constant. 

The approach via adding a noise component does not lead to a better 
breakdown behavior, because a single outlier can make the density value of 
the noise component arbitrarily small. 
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Corollary 4.7. Theorem 4.4 and Remark 4.6 hold as well for global 
maximizers of (2.10). 

Example 4.8. While the breakdown point for all considered approaches 
is the same for r < s, it may be of interest to determine how large an outlier 
has to be to cause breakdown of the methods. The following definition is 
used to generate reproducible datasets. 

Definition 4.9. $ -1 2 (l/(n + 1)), . . . , $^^. 2 (n/(n + 1)) is called an (a,a 2 )- 
Normal standard dataset (NSD) with n points, where $ a a i denotes the c.d.f. 
of the Normal distribution with parameters a, a 2 . 

Consider a dataset of 50 points, consisting of a (0, 1)-NSD with 25 points 
combined with a (5, 1)-NSD with 25 points (see Figure 1) and s = 2. For 
Normal mixtures, i^-mixtures with fi > 1 and Normal mixtures with noise 
component, the ML-estimators always result in components corresponding 
almost exactly to the two NSD's under do = 0.025 (see Section A.l). How 
large does an additional outlier have to be chosen so that the 50 original 
points fall into one single cluster and the second mixture component fits 
only the outlier? For Normal mixtures, breakdown begins with an additional 
point at about 15.2. For a mixture of t3-distributions the outlier has to lie 
at about 800, ii-mixtures need the outlier at about 3.8 x 10 6 and a Normal 
mixture with an additional noise component breaks down with an additional 
point at 3.5 x 10 7 . These values depend on o~q. 

Remark 4.10. Theorem 4.4 and Corollary 4.7 carry over to the clas- 
sification breakdown point. This follows because if r outliers are added, 
tending to oo and with the distance between them converging to oo as well, 
Lemma 4.2 yields that pij — > for the original points i = 1, ...,n and j 
satisfying a jm G [x n +g — d, x n+g + d] for some g G {1, . . . , r}. Thus, at most 
s — r clusters remain for the classification of the original points, which yields 
breakdown of r clusters; compare Remark 3.5. In contrast, the arguments 



o o ooootmEcnmcooooo o o o o oocoaxazrnjxaxxxio o o 
T 1 1 1 1 

? □ 2 4 f> 

Fig. 1. Above: 11 Standard" example dataset: 25 points (0,1)-NSD combined with 25 
points (5,1)-NSD. Below: Stars denote 13 additional equidistant points between 1.8 and 
3.2. 
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leading to Remark 4.6 (Normal case) do not carry over because the addition 
of r = s outliers as above certainly causes all mean parameters to explode, 
but one cluster usually remains containing all the original points. Therefore, 
an original cluster containing more than half of the points does not break 
down in the sense of classification. 

4.2. Alternatives for fixed s. The results given above indicate that the 
considered mixture methods are generally not breakdown robust for fixed 
s. A first proposal for the construction of estimators with better break- 
down behavior is based on the optimization of a target function for only 
a part of the data, say, optimally selected 50% or 80% of the points. The 
methods of trimmed &-means [Garcia-Escudero and Gordaliza (1999)] and 
clustering based on minimum covariance determinant estimators [Rocke and 
Woodruff (2000) and Gallegos (2003)] use this principle. Both methods, how- 
ever, assume a partition model as opposed to the mixture model. Such an 
assumption may be useful for clustering, but yields biased parameter esti- 
mators [Bryant and Williamson (1986)]. Weighted likelihood as proposed by 
Markatou (2000) might be an alternative for the mixture model. One of the 
estimators treated in the previous section might be used after removing out- 
liers by the nearest neighbor clutter removal procedure (NNC) of Byers and 
Raftery (1998). However, this procedure is based on mixture estimation as 
well (though not of location-scale type), and arguments analogous to those 
given above will lead to similar breakdown properties. As a simple example, 
consider a dataset consisting of a (0, 1)-NSD with 25 points, a (5, 1)-NSD 
with 25 points and an outlier at 50. The outlier at 50 is classified as "clutter" 
by NNC, but if another outlier as huge as 10 100 is added, NNC classifies 50 
as a nonoutlier. 

Another alternative can be constructed by modifying the uniform noise 
approach. The problem of this approach is that the noise component could 
be affected by outliers as well, as was shown in the previous section. This 
can be prevented by choosing the density constant for the noise component 
as fixed in advance, leading to ML-estimation for a mixture where some 
improper distribution component is added to model the noise. That is, an 
estimator £ ntS of the mean and variance parameters of the nonnoisy mixture 
components and of all the proportions is defined as the maximizer of 



where b > 0. The choice of b is discussed in Section A.l. For £ njS , the break- 
down point depends on the dataset x ra . Breakdown can only occur if addi- 
tional observations allow the nonoutliers to be fitted by fewer than s compo- 
nents, and this means that a relatively good solution for r < s components 
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must exist even for x n . This is formalized in the following theorem, where 
only the breakdown of a single mixture component i?i, n (Cn,s 5 x n) is consid- 
ered. 



Theorem 4.11. Let L n , s = L n>s (£ n>s , x n ), x n € R n . Let £ = £ n>s and 

/max = /(0)/<7 > b. If 



max 

r<s 



(4.9) 



then 



L n ,r < X] lo S ^jfOj (Xi) + (ko + &J 



\ Ti 
+ #log( 7T + - 6+ (n + 5f)log ; ^log/max, 

n J n + g 



(4.10) £ Xn )>_?_. 

Example 4.12. Consider the dataset of 50 points shown in Figure 1, 
/ = </>, b = 0.0117 and uq = 0.025 (cf. Section A.l). This results in L n% \ = 
— 119.7. Neither the optimal solution for s = 1 nor the one for s = 2 classi- 
fies any point as noise. The right-hand side of (4.9) equals —111.7 for g = 1 
and —122.4 for g = 2. Thus, the breakdown point is greater than 1/51. Em- 
pirically the addition of three extreme outliers at value 50, say, leads to a 
breakdown, namely to the classification of one of the two original compo- 
nents as noise and to the interpretation of the outliers as the second normal 
component. Two outliers do not suffice. Equation (4.10) is somewhat con- 
servative. This stems from the exclusion of the breakdown of a proportion 
parameter to 0, which is irrelevant for this example. 

A more stable data constellation with two clusters is obtained when a 
(50, 1)-NSD of 25 points is added to the (0, 1)-NSD of the same size. The 
optimal solution for one cluster classifies one of the two NSD's as noise and 
the other one as the only cluster, while the optimal solution for two clusters 
again does not classify any point as noise. Equation (4.9) leads to a minimal 
breakdown point of 8/58 for the two-cluster solution. At least 11 outliers (at 
500, say) are needed for empirical breakdown. 

4.3. Unknown s. The treatment of the number of components s as un- 
known is favorable for robustness against outliers, because outliers can be 
fitted by additional mixture components. Generally, for large enough out- 
liers the addition of a new mixture component for each outlier yields a better 
log-likelihood than any essential change of the original mixture components. 
Thus, gross outliers are almost harmless, except that they let the estimated 
number of components grow. 
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Breakdown may occur, however, because additional points inside the 
range of the original data may lead to a a solution with r < s clusters. 
Equation (4.11) of Theorem 4.13 is sufficient (but rather conservative) for 
preventing this. Breakdown can also occur due to gross outliers alone, sim- 
ply because the number of outliers becomes so large that the BIC penalty, 
which depends on n, is increased by so much that the whole original dataset 
implodes into fewer than s clusters. The conditions for this are given in 
(4.13) for BIC, while it cannot happen for AIC because its penalty does not 
depend on n. 

Theorem 4.13. Let r n = (s,r/ njS ) be a maximizer of BIC. If 

(4.11) min[L njS — L n>r — |(5g + 3s — 3r + 2n) log(n + g) + nlogra] > 0, 

then 

(4.12) B n ( TmXn )>_2 

n + g 

If 

(4.13) mm[L re , s -L n , r -§(s-r)log(n + sO] < 0, 
then 

(4.14) B n (T n ,Xn)<-2—. 

n + g 

Note that L nyS — L n>r > 3/2(s — r) logn always holds by definition of BIC. 
Sufficient conditions for breakdown because of "inliers" depend on the pa- 
rameters of certain suboptimal solutions for r < s mixture components for 
x n . They may be hard to derive and are presumably too complicated to be 
of practical use. 

Example 4.14. Consider again the combination of a (0, 1)-NSD with 25 
points and a (5, 1)-NSD with 25 points, f = f and o~$ chosen as in Exam- 
ple 4.12. The difference in (4.11) is 3.37 for g = 1 and —7.56 for g = 2; that 
is, the breakdown point is larger than 1/51. Many more points are empir- 
ically needed. Thirteen additional points, equally spaced between 1.8 and 
3.2, lead to a final estimation of only one mixture component (compare Fig- 
ure 1). It may be possible to find a constellation with fewer points where one 
component fits better than two or more components, but I did not find any. 
Breakdown because of gross outliers according to (4.13) needs more than 
650,000 additional points! 

A mixture of the (0, 1)-NSD with 25 points with a (50, 1)-NSD of size 25 
leads to a lower breakdown bound of 12/62. For estimated s, even a break- 
down point larger than 1/2 is possible, because new mixture components 
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can be opened for additional points. This may even happen empirically for 
a mixture of (0, 1)-NSD and (50, 1)-NSD, because breakdown by addition 
of gross outliers is impossible unless their number is huge, and breakdown 
by addition of "inliers" is difficult. For a (0, 0.001)-NSD of 25 points and a 
(100,000, 0.001)-NSD of 25 points, even the conservative lower breakdown 
bound is 58/108 > 1/2. 

The choice of the ii-distribution instead of the Normal leads to slightly 
better breakdown behavior. The mixture of a 25 point-(0, 1)-NSD and a 25 
point- (5, 1)-NSD yields a lower breakdown bound of 3/53, and empirically 
the addition of the 13 inliers mentioned above does not lead to breakdown of 
one of the two components, but to the choice of three mixture components 
by the BIC. Replacement of the (5, 1)-NSD by a (50, 1)-NSD again gives a 
small improvement of the lower bound to 13/63. 

Remark 4.15. The possible breakdown point larger than 1/2 is a con- 
sequence of using the addition breakdown definition. A properly defined re- 
placement breakdown point can never be larger than the portion of points in 
the smallest cluster, because this cluster must be driven to break down if all 
of its points are suitably replaced. This illustrates that the correspondence 
between addition and replacement breakdown as established by Zuo (2001) 
may fail in more complicated setups. 

The addition of a noise component again does not change the breakdown 
behavior. 

Theorem 4.16. Under / max > l/(x m ax,n — x m m,n)> Theorem 4.13 also 
holds for global maximizers of BIC, defined so that (2.10) is maximized for 
every fixed s . 

Example 4.17. The discussed data examples of two components with 
25 points each do not lead to different empirical breakdown behavior with 
and without an estimated noise component according to (2.10), because no 
point of the original mixture components is classified as noise by the solu- 
tions for two Normal components. In the case of a (0, 1)-NSD of 45 points 
and a (5, 1)-NSD of 5 points, the solution with one Normal component, 
classifying the points from the smaller NSD as noise, is better than any 
solution with two components. That is, no second mixture component ex- 
ists which could break down. The same holds for ^-mixtures (all points 
form the only component), while NMML shows almost the same behavior 
in Example 4.14: there are two mixture components corresponding to the 
two NSD's which can be joined by 12 equidistant points between 1.55 and 
3.55. Equation (4.12) evaluates again to 1/51. More examples are given in 
Hennig (2003). 



BREAKDOWN POINTS FOR MAXIMUM LIKELIHOOD ESTIMATORS 19 



Remark 4.18. While parameter breakdown due to the loss of a mixture 
component implies classification breakdown of at least one cluster, classifi- 
cation breakdown may occur with fewer additional points than parameter 
breakdown. Consider again the (0, 1)-NSD of 45 points plus the (5, 1)-NSD 
of 5 points and NMML. The smaller cluster breaks down by the addition 
of six points, namely two points each exactly at the smallest and the two 
largest points of the (5, 1)-NSD. This leads to the estimation of five clusters, 
namely the original (0, 1)-NSD, three clusters of three identical points each, 
and the remaining two points of the (5, 1)-NSD. The fifth cluster is most 
similar to the original one with 7 = |^ < | , while no parameter breakdown 
occurs. Thus, an arbitrarily large classification breakdown point is not pos- 
sible even for very well separated clusters, because not only their separation, 
but also their size matters. As in Section 4.2, the number of additional points 
required depends on do- 

5. Discussion. It has been shown that none of the discussed mixture 
model estimators is breakdown robust when the number of components s is 
assumed as known and fixed. An improvement can be achieved by adding 
an improper uniform distribution as an additional mixture component. 

The more robust way of estimating mixture parameters is the simulta- 
neous estimation of the number of mixture components s. Breakdown of 
mixture components may rather arise from the addition of points between 
the estimated mixture components of the original dataset than from gross 
outliers. It may be controversial if this is really a robustness problem. A sen- 
sible clustering method should be expected to reduce the estimated number 
of clusters if the gap between the clusters is filled with points, as long as 
their number is not too small. Compare Figure 1, where the NMML esti- 
mate of s = 1 and the t\ -mixture estimate of s = 3 may both seem to be 
acceptable. In such cases, the empirical breakdown point, or the more easily 
computable but conservative breakdown bound (4.12), may not be used to 
rule out one of the methods, but can rather be interpreted as a measure of 
the stability of the dataset with respect to clustering. 

While including the estimation of s leads to theoretically satisfying break- 
down behavior, robustness problems remain, in practice, because the global 
optimum of the log-likelihood has to be found. Consider, for example, a 
dataset of 1000 points, consisting of three well-separated clusters of 300 
points each and 100 extremely scattered outliers. The best solution requires 
103 clusters. Even for one-dimensional data, however, the EM-algorithm will 
be very slow for a large number of clusters, and there will be typically lots 
of local optima. Therefore, the maximum number of fitted components will 
often be much smaller than the maximum possible number of outliers and 
the results for fixed s remain relevant. The use of an improper noise compo- 
nent or, if extremely huge outliers are ruled out, the proper noise component 
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or ti-mixtures will clearly be superior to Normal mixtures with s estimated 
but restricted to be small. 

The comparison of the robustness characteristics of various cluster anal- 
ysis methods is an important topic and a first attempt is made this paper 
to define a classification breakdown point. It should not be the last word on 
the subject. Davies and Gather (2002) argue that a reasonable concept of a 
breakdown point should be linked to a sufficiently rich equivariance struc- 
ture to enable nontrivial upper bounds for the breakdown point. This does 
not hold for the concepts presented here, and it should be kept in mind that 
breakdown point definitions as those given here do not rule out meaningless 
estimators such as constants. The breakdown point should not be seen as the 
only important measure to judge the methods, but must be complemented 
by the consideration of their further properties. 

In some situations with low breakdown point in mixture modeling, ad- 
ditional outliers do not cause any substantial change unless they are huge 
(cf. Example 4.8, NNC in Section 4.2). More sensible measures than the 
breakdown point may be needed here. 

Neither MCLUST nor EMMIX is able to exactly reproduce the results 
given here. Both do not allow the specification of a lower scale bound. 
MCLUST produces an error if the EM-iteration leads to a sequence of vari- 
ance parameters converging to 0. This implies, in particular, that no single 
point can be isolated as its own mixture component. But such an isolation 
is crucial for the desirable breakdown behavior of the methods with esti- 
mated s. EMMIX terminates the iteration when the log-likelihood does not 
seem to converge. The preliminary iteration results, including one-point- 
components, are reported, but solutions with clear positive variances are fa- 
vored. Thus, the current implementations of the Normal mixture estimation 
with estimated s are essentially nonrobust. Addition of a noise component 
and t-mixtures perform better under outliers of moderate size, but they, too, 
are not robust against very extreme outliers. The results given here do not 
favor one of these two approaches over the other, and I think that the imple- 
mentation of a lower bound for the smallest covariance eigenvalue is more 
important an issue than the decision between the current implementations. 

Note that both packages enable the use of stronger scale restrictions 
(equivalent to equal variances for all mixture components in the one-dimensional 
case) , which should have roughly the same robustness characteristics for es- 
timated s as the methods considered here. However, in practice such restric- 
tions are often not justified. 

APPENDIX 

A.l. Choice of the tuning parameters oq and b. For the choice of ao, the 
following strategy is proposed. As a "calibration benchmark," form a dataset 
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with n points by adding an a n -outlier to a (0, 1)-NSD (recall Definition 4.9) 
with ri — l points. Davies and Gather (1993) define "a-outliers" (with a > 
but very small) with respect to an underlying model as points from a region 
of low density, chosen so that the probability of the occurrence of an outlier 
is equal to a under that model. For a standard Normal distribution, for 
example, the points outside [$ _1 (^),$ _1 (1 — %)] are the a-outliers. For 
a n = 1 — (1 — p) 1//n , the probability of the occurrence of at least one a n - 
outlier among n i.i.d. points is equal to p. Take p = 0.95, say. 

Consider NMML with estimated s under (2.11) (this seems to lead to 
reasonable values for all methods discussed in the present paper). Let c$ = 
<7o for this particular setup. Choose Co so that C(l) = C(2) according to 
(2.20). This can be carried out in a unique way because L n ,i(??n,l) does 
not depend on cq (as long as cq is smaller than the sample variance) and 
L n ,2(Vn,2) increases with decreasing cq, because this enlarges the parameter 
space. For cq small enough, the two-component solution will consist of one 
component matching approximately the ML-estimator for the NSD, ai will 
approximately equal the outlier and ai = Co, so that the increase in L n> 2(7/71,2) 
becomes strict. 

Now use do = co(T max , where <7^ ax is the largest variance such that a data 
subset with this variance can be considered as a "cluster" with respect to 
the given application. At least, if the mixture model is used as a tool for 
cluster analysis, points of a cluster should belong together in some sense, 
and, with regard to a particular application, it can usually be said that 
points above a certain variation can no longer be considered as "belonging 
together." Therefore, in most applications it is possible to choose <7 max in 
an interpretable manner, while this does not work for <7o directly. 

The rationale is that a sensible choice of <jq should lead to the estimation 
of the dataset as one component, if it does not contain any outlier in the 
sense of Davies and Gather (1993). If the nth point is an outlier, it should be 
fitted by a new mixture component. The reader is referred to Hennig (2003) 
for a more detailed discussion. 

Given c max , the improper density value b for maximization of (4.8) can be 
chosen as the density value at the 0.025-quantile of /o,o- max i so that at least 
95% of the points generated from a "cluster-generating" mixture component 
have a larger density value for their own parent distribution than for the 
noise component. In all examples cr max = 5 has been used, which leads to 
a = 0.025, 6 = 0.0117. 

Note that the theory in Section 4 assumes do as constant over n, so that 
it does not directly apply to the suggestion given here. 

Under (2.12), c = Co can be used because of scale equivariance, avoiding 
the specification of cr max . However, (2.12) does not properly generalize to 
fitting of a noise component and estimation of the number of components 
(the latter can be done by the choice of a suitable upper bound on s). 
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Lemma A.l. The following objective functions are unbounded from above 
under the restriction (2.12): 

1. the log-likelihood function (2.10) with fixed s; 

2. the AIC and BIC of model (2.1) with unknown s€N. 

Proof. Consider an arbitrary dataset x±, . . . , x n . For (2.10) choose a\ = 
xi, 7Ti > 0, o\ — > 0, ttq > 0. This implies that the summand for x\ converges 
to oo while all others are bounded from below by log(7ro/(x maXjn — x m in,n))- 
This proves part (1). For part (2) choose s = n, a\ = x\, . . . ,a s = x n , o\ = 
■ ■ ■ = a s — > 0. Thus, L n)S — > oo, and the same holds for AIC and BIC. □ 

A. 2. Proofs. 

Proof of Lemma 2.2. For any fixed a*, the maximizer aj of (2.17) 
lies between x max ,n and ic m in,n because of (2.2) and (2.3). Now show that 

< *o/(0) 

3-max,n 2<min,n)/^o) 

By a* = cr , 

n 1 fx - — a\ 1 

Sj(aj,<jj) > Vpylog— / — 3 - > n-Kj log— / 

^ o-o \ cr J cr 

For arbitrary crj, 

Sj( aj ,a*) KmrjilogfiO) - loga*). 

Therefore, 

log/(0)-log<Tj >log— / 1 j^-cfj < 



O~0 V 0~Q ) /((^max,n ^min^ )/<*) 

as long as nirj > 0, proving (2.18). □ 

Proof of Lemma 3.3. Recall (3.2). For given C, F can always be 
chosen to contain C\, . . . , C r , r > 2, with C C Ui=i Ci such that |Cj| < |C|/2 
V« for even |C|. For odd |C|, Ci G F with |d n C| = (|C| + l)/2 and |Ci \ 
C| > 1 can be constructed such that 7*(C, -F) = 7(C, Ci) < 2/3. On the other 
hand, VFG^:7*(C,F) > 2/3 if x n = C U {x} and 5 = {2}. □ 

Proof of Lemma 4.1. Note first that in case of maximizing (2.10) the 
density of the noise component l/(x maXjn+9 — x m i njn+9 ) converges to 0, so 
that all arguments, including those used in the proofs of Lemmas 4.2 and 4.3, 
hold for this case too. 

Assume w.l.o.g. that all aj m , j = 1, . . . , s, are outside [xi — d, x ni + d] for 
arbitrary d < oo and m large enough unless TTj m \ or cr jm f oo at least 
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for a subsequence of m G N. Consider 

ni / s \ 

L n ,s{Vm,X-nm) = ( 7r jmfa Jm ,a jrn J 

j=l \j=l / 

+ l °slj2^jmfa jm ,a jm (xi)). 

i=m+l \j=l / 

The first sum converges to — oo for m — > oo because of (2.7), and the second 
sum is bounded from above by (n — m) log(/(0)/ero), that is, L ntS (r]^,x nm ) — > 
-oo. In contrast, for f] m with a km = x Uk , a km = ctq, Tf km = \, k = l,...,h, 

r x )> ^ , f(( x n km - x {nk _ 1+1)m )/a ) Wop) 

^n.syVm^nm) ^ / n k log - £l n log > — OO. 

^ ha ha 

Hence, for m large enough, r] m cannot be ML. Since it should be ML, d has 
to exist so that (4.1) holds for m larger than some uiq. □ 

Proof of Lemma 4.2. 

Proof of (4.3). Suppose that (4.3) does not hold. Without loss of gen- 
erality [the order of the aj does not matter and a suitable subsequence of 
(Vm)m£N can always be found] assume 

lim min{|x - a lm \ : x G {x lm , x nm }} = oo. 

m— >oo 

Due to (2.7), 

r I Xim "lm \ r. w ■ 

-/ v». 



0~lm \ &lr 

With (2.6) and (4.1), 

Y^j m fa ]m ,a jm (xi)>d min = 7r min —^f(^^\ >0, i = l,...,n. 
Thus, for arbitrarily small e > and m large enough, 

L„ iS (?/ m ,x nm ) < J^log ^7r jm / ajmi(Tjm (xi) J + n(log(<i mm + e) - logd min ), 
i=i \j=2 J 

and log(d m i n + e) — log<i m i n \ for e \ 0. Thus, L HtS can be increased 
for e small enough by replacement of (7Ti m , ai m , ai m ) by (7Ti m , x\, gq) i n 
contradiction to 7] m being ML. 
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Proof of (4.4) by analogy to (4.3). Suppose that w.l.o.g. (j\ m — > oo. Then 
/ ^0 Vi, 

0~lm \ °~\m / 

and replacement of {Tt\ m , Q>im i °~ im ) by (irim, xi, do) increases the log-likelihood. 

□ 

Proof of Lemma 4.3. 

Proof of (4.5). Consider k G {1, . . . , h}. Let S k = [x (nfc _ 1+ i )m - d, x nkTn + 
d}. With Lemma 2.2, 

E\ -> 1 \ ~> ^jmfa^n^jmiXi) 



For <2j m G 5/0 and m — > oo, 

WW*»(*0 „ | 01 . /,;/),. 



J2l=l ^Imfa^^ira ( x i) 

while, for i £ Z^, 



J2l=l ^Imfai (Xi)\ 

This yields Eo meSt^'m — * l-Cfc|/ n [ & t least one of the 7Tj m in this sum is 
bounded away from by (4.1)]. 

Proof of (4.6). Let r) kmq = argmax,, L\ Dk \ >q (ri,y km ), gGN, 

h , 

L qi-q h m = Y L \D k \,q k (Vkmq k ) + \D k \ log 



k=l 



n 



Note that L n s (j] m ) > max^h _ L„ 1 ...„, m can be proved by choice of 77 ac- 

cording to 7Tj = ([Dj \/n)ir jmgj , Oj = a jm9j , aj = a jmqj , j = 1, . . . , h. Further, 
for m large enough and arbitrarily small e > 0, 

(A.l) L n>s (r) m ) < Y Y lo s( Y 7r 3mfa jm , trjm {.X i )\ + E, 

k =neD k \a jm es k / 

because, for Xi, i G D k , the sum over CLj m G 5^ is bounded away from as 
shown in the proof of Lemma 4.1, while the sum over a jm G [x(r H _ 1 +i)m ~ 
d,x nim + d], l^=k, vanishes for m — ► oo. Further, find 

Y l °sl Y ^imfa jm , ajm {xi) J - I^Dfcl logf Y K jm \<L\ Dklq {r] kmq ), 

i&D k \a jrn £S k I \a,- m e5fc / 
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where q = \{aj m 6 Sk}\- Now (4.6) follows from (4.5). □ 

Proof of Theorem 4.4. Letx (n+r)m = (xi,...,x n ,x {n+1)m ,...,X( n+r)m ), 
m G N w.l.o.g. Let Xi < • ■ • < x n , x^ n+ ^ m = x n + km, k = 1, . . . , r. This sat- 
isfies the assumptions of Lemma 4.1 for h = r + 1, so that the location pa- 
rameters for r components have to converge to oo with X( n+ i) m , . . . , X( n+r ) m . 
□ 



Proof of Theorem 4.11. Let x n+3 = (x\,.. .,x n+g ). Let £* = £, n + g ,s = 
argmax|L n+9iS (|,x n+9 ). For r < s, 

n / r s \ 

L n+g , s <J2 l °g{j2' K *f e *(x i )+ X rffo*(Xi) +7To&J + fflog/ max . 
i=l \J=1 j'=r+i / 

Assume that the parameter estimators of s — r (i.e., at least one) mix- 
ture components leave a compact set D of the form D = [7r m i n , 1] x C, 
CcKx K + compact, 7r m i n > 0. Let the mixture components be ordered such 
that (vr*, a*, a*) £ £> only for j = 1, . . . , r < s. From (2.6), Y%=i tffe* (xi) > 
rvr min/min; while J2j= r +i n j fe* i x i) becomes arbitrarily small for D large 
enough by (2.7). Thus, for arbitrary e > and D large enough, 

n / r \ 

L n + 9 ,s <^log ^ vr* f e * (xi ) + 7Tq b J + g log / max + s 

j=l \7 = 1 / 

(A.2) 

< maxL n , r + # log / max + e. 

However, £ could be defined by ttq = (n7ro + g) / (n -\- g) , frj = (n/(n + g))^- , 
a,j = dj, &j = Oj , j = 1, . . . , s. Therefore, 

9 



L n +g,s > X lo § X 7r J"^i ( X ») + ( 71-0 + ~ ) b 
i=l \j=l 



+ 9 log 



7T + - 

n 



+ {n + g) log 



max L n>r > V log ( V 7Tj/0. (x») + ( tt + - ) b 



+ g\og (^o + ^Jb 
This contradicts (4.9) by e -> 0. □ 



+ {n + g) log — ■ g\ogf n 

n + g 



Proof of Theorem 4.13. Add points x n+ x, . . .,x n+g to x n . Let C m (s,Tj) 
be the value of BIC for s mixture components and parameter r), applied to 
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the dataset x m , m>n. Let C m {s) be its maximum. With the same argu- 
ments as those leading to (A. 2), construct for arbitrary e > a suitably large 
compact Cclx ]R + , containing the location and scale parameters of all 
mixture components of r = (s,n) = (s,rj n>s ), and assume that (a*, a*) G C 
for only r < s components of r* = argmax^ C n + g (s, 77). We get 

n / r \ 

(A 3) <WO < 2^1og^vr*^(x l )J 

+ 25 log / max + e - (3s* - 1) log(n + g), 

and, by taking s = s + g, ttj = n/(n + g)iTj, j = l,...,s, 7t s+ i = • • • = n s+g = 
l/(n + g), 9j = 9j, j = l,...,s, a s+k = x n+k , a s +k = <Jq, k = l,...,g, 

(A.4) i=i \j=i n + 9 J 

+ 2g log ^ - (3( S + g) - 1) log(n + <?)• 
n + g 

By combination, 

n / s \ n / r n * \ 

E lo E E ^ (x<) -E'°S E v r 3 * fe* (xi) - e 
t=l \j=l / t=l \j=i Z^fc=l ^fc / 

3 

<g\og{n + g) - -{s* - (s + g))\og(n + g) 
-ralog— ^— +nlog Y\^t\ 

< — (5<7 + 3s — 3r + 2n) log(n + 5) — n log n. 

Under (4.11) this cannot happen for arbitrarily small e. 

A sufficient condition for breakdown can be derived by explicit contam- 
ination. Let y = x n +i = • • • = x n + g . For fixed s, it follows from Lemma 4.3 
that 

(n g \ 
L n g-1 + g log (/max) + Tl log ; h g log ; 
n+g n + g J 

- (3s - l)log(n + s)- 

This cannot be maximized by s* = s > s + 1 because the penalty on s is 
larger for n + g points than for n points and s* — 1 with parameters max- 
imizing L n s *_i(i7,x n ) must already be a better choice than s for n points 
unless s* < s + 1. It follows that the existence of r < s with 

2L n , s - (3(a + 1) - 1) log(n + g) < 2L n>r - (3(r + 1) - 1) log(n + g) 
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suffices for breakdown of at least one component, which is equivalent to 
(4.13). 



□ 



Proof of Theorem 4.16. Let 

d = , d* = . 

^max,n ^min,n %max,n+g %min,n+g 

Replace (A. 3) by 

n / r \ 

(A5) C n+g ( S *)<2^1og^7r*/ 0; (^)+7r o *d*J 

+ 2c/ log / ma x + e- (3s* - l)log(ra + #), 

and (A. 4) by 



n is 



C n+g { S *) > 2^l0g HT ——vjfg fa) + 7r ci 

~{ \~{ n + 9 3 n + g J 

+ 2 ff log ^ - (3(a + g) - 1) log(n + g). 
n + g 

Equation (4.12) follows from d > d* in (A. 5). □ 

Lemma 4.3 holds as well for maximizers of (2.10), and therefore (4.14) 
carries over as well. 

Acknowledgments. I thank a referee and Vanessa Didelez for helpful 
comments. 
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